Lesson 4 - TWO VARIABLES


Scatterplots and Perceived Audience Size

Notes:


Scatterplots

Notes:

library(ggplot2)
pf <- read.csv('pseudo_facebook.tsv', sep = '\t')
qplot(x = age, y = friend_count, data = pf)

# idem qplot(age, friend_count, data = pf)

What are some things that you notice right away?

Response:


ggplot Syntax

Notes:

# ggplot works in layers and the axis in the aesthetic function
ggplot(aes(x = age, y = friend_count), data = pf) + geom_point() + xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).

summary(pf$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   20.00   28.00   37.28   50.00  113.00

Overplotting

Notes: makes it difficult to tell how many points there are in a region (too many points together) We can use transparency with alpha parameter in the geom_point. 1/20, it takes 20 points to be as dark as 1 point before.

ggplot(aes(x = age, y = friend_count), data = pf) + geom_point(alpha = 1/20) + xlim(13,90)
## Warning: Removed 4906 rows containing missing values (geom_point).

Ages are separated in years but that is not real, we add some jitter, some noise to move the dots a little bit.

ggplot(aes(x = age, y = friend_count), data = pf) + geom_jitter(alpha = 1/20) + xlim(13,90)
## Warning: Removed 5168 rows containing missing values (geom_point).

What do you notice in the plot?

Response:


Coord_trans()

Look up the documentation for coord_trans() and add a layer to the plot that transforms friend_count using the square root function. Create your plot!

Notes:

ggplot(aes(x = age, y = friend_count), data = pf) + 
  #geom_jitter(alpha = 1/20) + 
  # to avoid jitter adding points below zero, which when making the the sqrt will give us imaginary points
  geom_point(alpha = 1/20, position = position_jitter(height = 0)) + 
  xlim(13,90) +
  #scale_y_sqrt()
  coord_trans(y = 'sqrt')
## Warning: Removed 5172 rows containing missing values (geom_point).


Alpha and Jitter

Notes:

ggplot(aes(x = age, y = friendships_initiated), data = pf) + 
  #geom_jitter(alpha = 1/20) + 
  # to avoid jitter adding points below zero
  #geom_point() +
  geom_point(alpha = 1/20, position = position_jitter(height = 0)) + 
  #xlim(13,90) +
  scale_y_sqrt()

  #coord_trans(y = 'sqrt')

Overplotting and Domain Knowledge

Notes:


Conditional Means

Notes:DPLYR PACKAGE

Title A Grammar of Data Manipulation. It’s the next iteration of plyr, focused on tools for working with data frames (hence the d in the name).

It is a fast, consistent tool for working with data frame like objects, both in memory and out of memory.

It has three main goals: * Identify the most important data manipulation verbs and make them easy to use from R. * Provide blazing fast performance for in-memory data by writing key pieces in C++ (using Rcpp) *Use the same interface to work with data no matter where it’s stored, whether in a data frame, a data table or database.

Summarise: works in an analogous way to mutate, except instead of adding columns to an existing data frame, it creates a new data frame. This is particularly useful in conjunction with dplyr as it makes it easy to perform group-wise summaries.

#install.packages('dplyr')
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# dplyr:
#filter(): find rows/cases where conditions are true. Unlike base subsetting, rows where the
#condition evaluates to NA are dropped.
#group_by(): takes an existing tbl and converts it into a grouped tbl where operations are performed "by group". 
#ungroup(): removes grouping.  
#mutate(): adds new variables and preserves existing
#arrange(): Arrange rows by variables
#tbl(): Create a table from a data source
#summarise(): Reduces multiple values down to a single value (in our case will be mean and median). It is typically used on grouped data created by group_by(). The output will have one row for each group.
#transmute(): drops existing variables.

age_groups <- group_by(pf, age)
pf.fc_by_age <- summarise(age_groups,
          friend_count_mean = mean(friend_count),
          friend_count_median = median(friend_count),
          n = n())
head(pf.fc_by_age)
## # A tibble: 6 x 4
##     age friend_count_mean friend_count_median     n
##   <int>             <dbl>               <dbl> <int>
## 1    13          164.7500                74.0   484
## 2    14          251.3901               132.0  1925
## 3    15          347.6921               161.0  2618
## 4    16          351.9371               171.5  3086
## 5    17          350.3006               156.0  3283
## 6    18          331.1663               162.0  5196
# in case it was not ordered
# pf.fc_by_age <- arrange(pf.fc_by_age, age)

Alternative Code Conditional Means

CHAIN OPERATOR (%.%) & MAGRITTR OPERATOR (%>%)

%.% allows us to chain functions onto our data set. Please note that in newer versions of dplyr (0.3.x+), the syntax %.% has been deprecated and replaced with %>%.

The %.% operator in dplyr allows one to put functions together without lots of nested parentheses. The flanking percent signs are R’s way of denoting infix operators; you might have used %in% which corresponds to the match function or %*% which is matrix multiplication. The %.% operator is also called chain, and what it does is rearrange the call to pass its left hand side on as a parameter to the right hand side function. As noted in the documentation this makes function calls read from left to right instead of inside and out.

As mentioned, dplyr is not the only package that has something like this, and according to a comment from Hadley Wickham, future dplyr will use the magrittr package instead, a package that adds piping to R. So let’s look at magrittr! The magrittr %>% operator works much the same way, except it allows one to put ”.” where the data is supposed to go. This means that the data doesn’t have to be the first argument to the function.

Another warning: Version 0.4.0 of dplyr has a bug when using the median function on the summarize layer, depending on the nature of the data being summarized. You may need to cast the data as a numeric (float) type when using it on your local machine, e.g. median(as.numeric(var)).

pf.fc_by_age <- pf %>%
  group_by(age) %>%
  summarise(friend_count_mean = mean(friend_count),
          friend_count_median = median(as.numeric(friend_count)),
          n = n()) %>%
  #just in case it was not ordered
  arrange(age)
  
head(pf.fc_by_age)
## # A tibble: 6 x 4
##     age friend_count_mean friend_count_median     n
##   <int>             <dbl>               <dbl> <int>
## 1    13          164.7500                74.0   484
## 2    14          251.3901               132.0  1925
## 3    15          347.6921               161.0  2618
## 4    16          351.9371               171.5  3086
## 5    17          350.3006               156.0  3283
## 6    18          331.1663               162.0  5196

Create your plot!

# Plot mean friend count vs. age using a line graph.
# Be sure you use the correct variable names
# and the correct data frame. You should be working
# with the new data frame created from the dplyr
# functions. The data frame is called 'pf.fc_by_age'.

# Use geom_line() rather than geom_point to create
# the plot. You can look up the documentation for
# geom_line() to see what it does.
ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) + 
  geom_line()


Overlaying Summaries with Raw Data

Notes: ggplot 2.0.0 changes the syntax for parameter arguments to functions when using stat = ‘summary’. To denote parameters that are being set on the function specified by fun.y, use the fun.args argument.

ggplot(aes(x = age, y = friend_count), data = pf) + 
  xlim(13,90) +
  geom_point(alpha = 0.05, 
             position = position_jitter(height = 0),
             color = 'orange') +
  coord_trans(y = 'sqrt') +
  geom_line(stat = 'summary', fun.y = mean) +
  #before ggplot 2.0.0 geom_line(stat = 'summary', fun.y = quantile, probs = .1,
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1),
            linetype = 2, color= 'blue') +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .9),
            linetype = 2, color= 'red') +
  # median
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .5),
            linetype = 1, color= 'green')
## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).

## Warning: Removed 4906 rows containing non-finite values (stat_summary).
## Warning: Removed 5164 rows containing missing values (geom_point).

What are some of your observations of the plot?

Response: Most users have less than 1000 friends

COORD CARTESIAN LAYER:

# To zoom in, the code should use the coord_cartesian(xlim = c(13, 90)) layer rather than xlim(13, 90) layer. Setting limits on the coordinate system will zoom the plot (like you're looking at it with a magnifying glass), and will not change the underlying data like setting limits on a scale will. See difference with next plot, the summaries vary if we limit the scale.

ggplot(aes(x = age, y = friend_count), data = pf) + 
  #xlim(13,70),  ylim = c(0, 1000)) +
  coord_cartesian(xlim = c(13, 70), ylim = c(0, 1000)) +
  geom_point(alpha = 0.05, 
             position = position_jitter(height = 0),
             color = 'orange') +
  # coord_trans(y = 'sqrt') +
  geom_line(stat = 'summary', fun.y = mean) +
  #before ggplot 2.0.0 geom_line(stat = 'summary', fun.y = quantile, probs = .1,
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1),
            linetype = 2, color= 'blue') +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .9),
            linetype = 2, color= 'red') +
  # median
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .5),
            linetype = 1, color= 'green')

ggplot(aes(x = age, y = friend_count), data = pf) + 
  xlim(13,70) +  ylim(0,1000) +
  #coord_cartesian(xlim = c(13, 70), ylim = c(0, 1000)) +
  geom_point(alpha = 0.05, 
             position = position_jitter(height = 0),
             color = 'orange') +
  # coord_trans(y = 'sqrt') +
  geom_line(stat = 'summary', fun.y = mean) +
  #before ggplot 2.0.0 geom_line(stat = 'summary', fun.y = quantile, probs = .1,
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1),
            linetype = 2, color= 'blue') +
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .9),
            linetype = 2, color= 'red') +
  # median
  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .5),
            linetype = 1, color= 'green')
## Warning: Removed 10493 rows containing non-finite values (stat_summary).

## Warning: Removed 10493 rows containing non-finite values (stat_summary).

## Warning: Removed 10493 rows containing non-finite values (stat_summary).

## Warning: Removed 10493 rows containing non-finite values (stat_summary).
## Warning: Removed 10927 rows containing missing values (geom_point).

```

Moira: Histogram Summary and Scatterplot

See the Instructor Notes of this video to download Moira’s paper on perceived audience size and to see the final plot.

Notes:


Correlation

Notes: Measure the strength between two variables.

# see ?cor.test, we will be using now the pearson method to calculate the corr coefficient
cor.test(pf$age, pf$friend_count, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737
#equivalent, using with function --> evaluate an expression in an environment constructed from the data
with(pf, cor.test(pf$age, pf$friend_count, method = 'pearson'))
## 
##  Pearson's product-moment correlation
## 
## data:  pf$age and pf$friend_count
## t = -8.6268, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03363072 -0.02118189
## sample estimates:
##         cor 
## -0.02740737
# Pearson is the default method

Look up the documentation for the cor.test function.

What’s the correlation between age and friend count? Round to three decimal places. Response: -0.0274 –> there is not meaningful relationshipg between these two veriables. As a rule of thumb, any correlation greater than 0.3 o less than -0.3 is meaningful, but small. 0.5 is moderate, and 0.7 of greater is pretty large.


Correlation on Subsets

Notes:

with(subset(pf, age <= 70), cor.test(age, friend_count))
## 
##  Pearson's product-moment correlation
## 
## data:  age and friend_count
## t = -52.592, df = 91029, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1780220 -0.1654129
## sample estimates:
##        cor 
## -0.1717245

When age increases, frien_count decreases, but it is important to note that correlation doesn’t mean causation.


Correlation Methods

Notes: There are many other relations some monotonic.

In mathematics, a monotonic function (or monotone function) is a function between ordered sets that preserves or reverses the given order.

with(subset(pf, age <= 70), cor.test(age, friend_count,
     method = 'spearman'))
## Warning in cor.test.default(age, friend_count, method = "spearman"): Cannot
## compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  age and friend_count
## S = 1.5782e+14, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.2552934
# calculating rho

Calculating those coefficients are not as good as creating a scatterplot and computing conditional summaries, these give you a richer understanding.


Create Scatterplots

Notes:

# do not forget the aesthetic wrapper
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
  coord_cartesian(xlim = c(0, 5000), ylim = c(0, 10000)) +
  geom_point(color = 'orange') +
  geom_line(stat = 'summary', fun.y = mean) 

#ggplot(aes(x = age, y = friend_count), data = pf) + 
  #xlim(13,90) +
 # coord_cartesian(xlim = c(13, 70), ylim = c(0, 1000)) +
#  geom_point(alpha = 0.05, 
 #            position = position_jitter(height = 0),
#             color = 'orange') +
  # coord_trans(y = 'sqrt') +
 # geom_line(stat = 'summary', fun.y = mean) +
  #before ggplot 2.0.0 geom_line(stat = 'summary', fun.y = quantile, probs = .1,
  #geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .1),
   #         linetype = 2, color= 'blue') +
#  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .9),
 #           linetype = 2, color= 'blue') +
  # median
#  geom_line(stat = 'summary', fun.y = quantile, fun.args = list(probs = .5),
 #           linetype = 1, color= 'blue')
ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
  geom_point(color = 'orange')


Strong Correlations

Notes: Adding a smoother and set the method to the linear model lm

ggplot(aes(x = www_likes_received, y = likes_received), data = pf) +
  geom_point(color = 'orange') +
  #using 95 percentile
  xlim(0, quantile(pf$www_likes_received, 0.95)) +
  ylim(0, quantile(pf$likes_received, 0.95)) +
  #adding a smoother and set the method to the linear model lm
  geom_smooth(method = 'lm', color = 'red')
## Warning: Removed 6075 rows containing non-finite values (stat_smooth).
## Warning: Removed 6075 rows containing missing values (geom_point).

The correlation coefficient is invariant under a linear transformation of either X or Y, and the slope of the regression line when both X and Y have been transformed to z-scores is the correlation coefficient.

It’s important to note that we may not always be interested in the bulk of the data. Sometimes, the outliers ARE of interest, and it’s important that we understand their values and why they appear in the data set.

What’s the correlation betwen the two variables? Include the top 5% of values for the variable in the calculation and round to 3 decimal places. Strongly correlated because one of the variables is a superset of the other.

cor.test(pf$www_likes_received, pf$likes_received)
## 
##  Pearson's product-moment correlation
## 
## data:  pf$www_likes_received and pf$likes_received
## t = 937.1, df = 99001, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9473553 0.9486176
## sample estimates:
##       cor 
## 0.9479902

Response:


Moira on Correlation

Notes: A way to see how related are the variables for a study. Before starting the study check any pair of variables through correlation.


More Caution with Correlation

Notes: Correlation coefficients can be deceptive if you are not carefuL. Plotting your data will give you another view.

ALR3 PACKAGE

Applied Linear Regression 3rd edition This package is a companion to the textbook S. Weisberg (2005), ``Applied Linear Regression,’’ 3rd edition, Wiley. It includes all the data sets discussed in the book (except one), and a few functions that are tailored to the methods discussed in the book. We willbe using the Mitchel data set (Soil Temperature Data)

#install.packages('alr3')
library(alr3)
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
data(Mitchell)
?Mitchell

Create your plot!

ggplot(aes(x = Month, y = Temp), data = Mitchell)+
  geom_point()


Noisy Scatterplots

  1. Take a guess for the correlation coefficient for the scatterplot.

  2. What is the actual correlation of the two variables? (Round to the thousandths place)

cor.test(Mitchell$Month, Mitchell$Temp)
## 
##  Pearson's product-moment correlation
## 
## data:  Mitchell$Month and Mitchell$Temp
## t = 0.81816, df = 202, p-value = 0.4142
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.08053637  0.19331562
## sample estimates:
##        cor 
## 0.05747063
# correlation is pretty low

Making Sense of Data

Notes: We are going to consider the x axis discrete as it represents Months

ggplot(aes(x = Month, y = Temp), data = Mitchell) +
  geom_point() +
  scale_x_continuous(breaks = seq(0, 203, 12))

#range Mitchell$Month will give you 0 - 203
# As of ggplot 2.0, you will need to use a scale_x_continuous() layer instead of scale_x_discrete(), since the Month is considered a numeric variable.

A New Perspective

What do you notice? Response: Proportion and scale of the graphic do matter, if we shrink the y axis we´ll see kind of a sinusoidal line. Showing fluctuation of temperature every 12 months.

Watch the solution video and check out the Instructor Notes! Notes:


Understanding Noise: Age to Age Months

Notes: What we have is that the estimated mean friend count for each ages is the true mean, plus some noise. The noise would be more if we have chosen finer bins for age.

ggplot(aes(x = age, y = friend_count_mean), data = pf.fc_by_age) + 
  geom_line()


Age with Months Means

pf$age_with_months <- pf$age + (1 - pf$dob_month / 12)

pf.fc_by_age_months <- pf %>%
  group_by(age_with_months) %>%
  summarise(friend_count_mean = mean(friend_count),
          friend_count_median = median(as.numeric(friend_count)),
          n = n()) %>%
  arrange(age_with_months)
  
head(pf.fc_by_age_months)
## # A tibble: 6 x 4
##   age_with_months friend_count_mean friend_count_median     n
##             <dbl>             <dbl>               <dbl> <int>
## 1        13.16667          46.33333                30.5     6
## 2        13.25000         115.07143                23.5    14
## 3        13.33333         136.20000                44.0    25
## 4        13.41667         164.24242                72.0    33
## 5        13.50000         131.17778                66.0    45
## 6        13.58333         156.81481                64.0    54

Age with Months Alternative solution

age_with_months_groups <- group_by(pf, age_with_months)
pf.fc_by_age_months2 <- summarise(age_with_months_groups,
          friend_count_mean = mean(friend_count),
          friend_count_median = median(as.numeric(friend_count)),
          n = n())
pf.fc_by_age_months2 <- arrange(pf.fc_by_age_months2, age_with_months)
  
head(pf.fc_by_age_months2)
## # A tibble: 6 x 4
##   age_with_months friend_count_mean friend_count_median     n
##             <dbl>             <dbl>               <dbl> <int>
## 1        13.16667          46.33333                30.5     6
## 2        13.25000         115.07143                23.5    14
## 3        13.33333         136.20000                44.0    25
## 4        13.41667         164.24242                72.0    33
## 5        13.50000         131.17778                66.0    45
## 6        13.58333         156.81481                64.0    54

Programming Assignment


Noise in Conditional Means

# Create a new line plot showing friend_count_mean versus the new variable,
# age_with_months. Be sure to use the correct data frame (the one you created
# in the last exercise) AND subset the data to investigate users with ages less
# than 71.
ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) +
  geom_line()

# We get a much noiser plot, we are reducing the size of the bin and increasing the number of them, so we have less values to estimate the mean, and much  more noise.

Smoothing Conditional Means

Notes: EXAMPLE OF BIAS VARIANCE TRADE OFF similar to the trade off we’ve made when choosing the bin width of the histograms. One way of doing a better trade off is by using a flexible statistical model to smooth our estimates of conditional means. ggplot make it easier fit such models using geom_smooth. So instead of seeing all this noise, we´ll have a smooth modular function that will fit along the data.

p1 <- ggplot(aes(x = age, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) + 
  geom_line()


p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) +
  geom_line()

p3 <- ggplot(aes(x = round(age / 5) * 5, y = friend_count), data = subset(pf, age < 71)) +
  geom_line(stat = 'summary', fun.y = mean) # because I want to plot the mean friend count

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(p2, p1, p3, ncol = 1)

# in p3 we have wider bin widths, we would estimate the mean more precisely but potentially miss important features of the age.
# using geom_smooth default paramenters
# we can see the smoothers for age_with_months and for age. The smoother captures some of the features of the relationship, it doesn´t draw attention to the non-motonic relationship in the low ages well, it also really misses the discontinuity at age 69. This highlights that using models like loess or smoothing splines can be useful. But like nearly model it can be subject to sistematic errors when the true process generating our data isn't so consistent with the model itself.
p1 <- ggplot(aes(x = age, y = friend_count_mean), data = subset(pf.fc_by_age, age < 71)) + 
  geom_line() +
  geom_smooth()


p2 <- ggplot(aes(x = age_with_months, y = friend_count_mean), data = subset(pf.fc_by_age_months, age_with_months < 71)) +
  geom_line() +
  geom_smooth()

p3 <- ggplot(aes(x = round(age / 5) * 5, y = friend_count), data = subset(pf, age < 71)) +
  geom_line(stat = 'summary', fun.y = mean) # because I want to plot the mean friend count

library(gridExtra)
grid.arrange(p2, p1, p3, ncol = 1)
## `geom_smooth()` using method = 'loess'
## `geom_smooth()` using method = 'loess'

# in p3 we have wider bin widths, we would estimate the mean more precisely but potentially miss important features of the age.

Which Plot to Choose?

Notes: you don´t have to use, in EDA we´ll often create multiple visualizations and summaries of the same data, gleaning different insights from each.


Analyzing Two Variables

Reflection:


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!